Alexandre Strube // Sabrina Benassou
February 28, 2023
| Time | Title |
|---|---|
| 09:00 - 09:15 | Welcome |
| 09:15 - 10:00 | Introduction |
| 10:00 - 10:15 | Coffee break |
| 10:00 - 10:30 | Judoor, Keys |
| 10:30 - 11:00 | SSH, Jupyter, VS Code |
| 11:00 - 11:15 | Coffee Break |
| 11:15 - 12:00 | Running services on the login and compute nodes |
| 12:00 - 12:15 | Coffee Break |
| 12:30 - 13:00 | Sync (everyone should be at the same point) |
TL;DR: 89856 cores, 3744 GPUs, 468 TB RAM 💪
Way deeper technical info at Juwels Booster Overview
TL;DR: Smaller than JUWELS Booster, but still packs a punch 🤜
Way deeper technical info at JUSUF Overview
training2303$ ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC
Generating public/private ed25519 key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/strube1/.ssh/id_ed25519-JSC
Your public key has been saved in /Users/strube1/.ssh/id_ed25519-JSC.pub
The key fingerprint is:
SHA256:EGNNC1NTaN8fHwpfuZRPa50qXHmGcQjxp0JuU0ZA86U strube1@Strube-16
The keys randomart image is:
+--[ED25519 256]--+
| *++oo=o. . |
| . =+o .= o |
| .... o.E..o|
| . +.+o+B.|
| S =o.o+B|
| . o*.B+|
| . . = |
| o . |
| . |
+----[SHA256]-----+codeHost jusuf
HostName jusuf.fz-juelich.de
User [MY_USERNAME]
IdentityFile ~/.ssh/id_ed25519-JSC
Host booster
HostName juwels-booster.fz-juelich.de
User [MY_USERNAME]
IdentityFile ~/.ssh/id_ed25519-JSCCopy contents to the config file and save it.
curl ifconfig.me(Ignore the % sign)
code key.txt and paste the number you gotcode key.txt and paste the number you got93.199.55.1600.0/16: 93.199.0.0/16from="" around itfrom="93.199.0.0/16",10.0.0.0/8 🧙♀️93.199.0.0/16cat ~/.ssh/id_ed25519-JSC.pubkey.txt
which you just opened93.199.0.0/16This might take some minutes
That’s it! Give it a try (and answer yes)
$ ssh jusuf
The authenticity of host 'jusuf.fz-juelich.de (134.94.0.185)' cannot be established.
ED25519 key fingerprint is SHA256:ASeu9MJbkFx3kL1FWrysz6+paaznGenChgEkUW8nRQU.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? Yes
**************************************************************************
* Welcome to JUSUF *
**************************************************************************
...
...
strube1@jusuf ~ $ # Create a shortcut for the project on the home folder
ln -s $PROJECT_training2303 ~/course
# Create a folder for myself
mkdir course/$USER
# Enter course folder and
cd ~/course/$USER
# Where am I?
pwd
# We well need those later
mkdir ~/course/$USER/.cache
mkdir ~/course/$USER/.config
mkdir ~/course/$USER/.fastai
ln -s ~/course/$USER/.cache $HOME/
ln -s ~/course/$USER/.config $HOME/
ln -s ~/course/$USER/.fastai $HOME/module spiderstrube1$ module spider PyTorch
------------------------------------------------------------------------------------
PyTorch:
------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration.
PyTorch is a deep learning framework that puts Python first.
Versions:
PyTorch/1.7.0-Python-3.8.5
PyTorch/1.8.1-Python-3.8.5
PyTorch/1.11-CUDA-11.5
PyTorch/1.12.0-CUDA-11.7
Other possible modules matches:
PyTorch-Geometric PyTorch-Lightning
...module avail (Inside hierarchy)
Stage (full collection of software of a given year)
Compiler
MPI
Module
Eg:
module load Stages/2023 GCC OpenMPI PyTorch
module spider Software/version
Search for the software itself - it will suggest a version
Search with the version - it will suggest the hierarchy
(make sure you are still connected to JUSUF)
Oh noes! 🙈
Let’s bring Python together with PyTorch!
Copy and paste these lines
# This command fails, as we have no proper python
python
# So, we load the correct modules...
module load Stages/2023
module load GCC OpenMPI Python PyTorch
# And we run a small test: import pytorch and ask its version
python -c "import torch ; print(torch.__version__)" Should look like this:
module key”module key toml
The following modules match your search criteria: "toml"
------------------------------------------------------------------------------------
Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5, Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4
Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.
PyQuil: PyQuil/3.0.1
PyQuil is a library for generating and executing Quil programs on the Rigetti Forest platform.
Python: Python/3.8.5, Python/3.9.6, Python/3.10.4
Python is a programming language that lets you work more quickly and integrate your systems more effectively.
------------------------------------------------------------------------------------From the ssh connection, navigate to your “course” folder and to the name you created earlier.
This is out working directory. We do everything here.
matrix.py” on VSCode on JusufPaste this into the file:
module load Stages/2023
module load GCC OpenMPI PyTorch
python matrix.py
Simple Linux Utility for Resource Management
code jusuf-matrix.sbatch
#!/bin/bash -x
#SBATCH --account=training2303 # Who pays?
#SBATCH --nodes=1 # How many compute nodes
#SBATCH --job-name=matrix-multiplication
#SBATCH --ntasks-per-node=1 # How many mpi processes/node
#SBATCH --cpus-per-task=1 # How many cpus per mpi proc
#SBATCH --output=output.%j # Where to write results
#SBATCH --error=error.%j
#SBATCH --time=00:01:00 # For how long can it run?
#SBATCH --partition=gpus # Machine partition
#SBATCH --reservation=training-20230229 # For today only
module Stages/2023
module load GCC OpenMPI PyTorch # Load the correct modules on the compute node(s)
srun python matrix.py # srun tells the supercomputer how to run itsqueue --me
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
412169 gpus matrix-m strube1 CF 0:02 1 jsfc013# Notice that this number is the job id. It's different for every job
cat output.412169
cat error.412169 Or simply open it on VSCode!
Jupyter-JSC calls slurm, just the same as your job
When you are working on it, you are using compute node time
Yes, if you are just thinking and looking at the 📺, you are burning project time🤦♂️
It’s useful for small tests - not for full-fledged development
pip….Link: MLflow quickstart
mlflow[extras]
fastai./setup.shActivate the environment where MLFlow is with
source ./activate.sh
source ./activate.sh
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) Stages/2023
The following have been reloaded with a version change:
1) HDF5/1.12.2-serial => HDF5/1.12.2
python
Python 3.10.4 (main, Oct 4 2022, 08:48:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mlflow
>>> mlflow.__version__
'2.1.1'import os
from random import random, randint
from mlflow import log_metric, log_param, log_artifacts
if __name__ == "__main__":
# Log a parameter (key-value pair)
log_param("param1", randint(0, 100))
# Log a metric; metrics can be updated throughout the run
log_metric("foo", random())
log_metric("foo", random() + 1)
log_metric("foo", random() + 2)
# Log an artifact (output file)
if not os.path.exists("outputs"):
os.makedirs("outputs")
with open("outputs/test.txt", "w") as f:
f.write("hello world!")
log_artifacts("outputs")#!/bin/bash -x
#SBATCH --account=training2303
#SBATCH --nodes=1
#SBATCH --job-name=mlflow-demo
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=output.%j
#SBATCH --error=err.%j
#SBATCH --time=00:10:00
#SBATCH --partition=gpus
#SBATCH --reservation=training-20230229 # For today only
# Make sure we are on the right directory
cd /p/home/jusers/$USER/jusuf/course/$USER
# This loads modules and python packages
source sc_venv_template/activate.sh
# Run the demo
srun python mlflow-demo.pyExample: MLFlow
mlflow ui --port 3000
ssh -L :1234:localhost:3000 jusuf
ssh -L :1234:localhost:3000 jusuffastai-demo.pyfrom fastai.vision.all import *
print("Downloading dataset...")
path = untar_data(URLs.PETS)/'images'
print("Finished downloading dataset")
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
print("On the login node, this will download resnet34")
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)fastai-demo.sbatch#!/bin/bash -x
#SBATCH --account=training2303
#SBATCH --mail-user=MYUSER@fz-juelich.de
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --job-name=matrix-multiplication
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=output.%j
#SBATCH --error=err.%j
#SBATCH --time=00:10:00
#SBATCH --partition=gpus
#SBATCH --reservation=training-20230229 # For today only
cd /p/home/jusers/$USER/jusuf/course/$USER
source sc_venv_template/activate.sh # Now we finally use the fastai module
srun python fastai-demo.pyerror.${JOBID} file File "/p/software/jusuf/stages/2023/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/urllib/request.py", line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/p/software/jusuf/stages/2023/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/urllib/request.py", line 1351, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>
srun: error: jsfc013: task 0: Exited with exit code 1$ source sc_venv_template/activate.sh
$ python fastai-demo.py
Downloading dataset...
|████████-------------------------------| 23.50% [190750720/811706944 00:08<00:26]
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /p/project/ccstao/cstao05/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|█████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 266MB/s](To exit, type CTRL-C)
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
Downloading dataset...
Finished downloading dataset
epoch train_loss valid_loss error_rate time
Epoch 1/1 : |-----------------------------------| 0.00% [0/92 00:00<?]
Epoch 1/1 : |-----------------------------------| 2.17% [2/92 00:14<10:35 1.7452]
Epoch 1/1 : |█----------------------------------| 3.26% [3/92 00:14<07:01 1.6413]
Epoch 1/1 : |██---------------------------------| 5.43% [5/92 00:15<04:36 1.6057]
...
....
Epoch 1/1 :
epoch train_loss valid_loss error_rate time
0 0.049855 0.021369 0.007442 00:42 Follow the example from mlflow.fastai
Add this at the beginning of your code:
Change the training line to this:
mlflow ui on the login
nodefrom fastai.vision.all import *
import mlflow.fastai
from mlflow import MlflowClient
path = untar_data(URLs.PETS)/'images'
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
learn = vision_learner(dls, resnet34, metrics=error_rate)
# Enable auto logging
mlflow.fastai.autolog()
# Start MLflow session
with mlflow.start_run() as run:
learn.fine_tune(1)As of now, I expect you managed to:
Type on your machine “code $HOME/.ssh/config” and paste
this at the end:
# -- Compute Nodes --
Host *.booster
User [ADD YOUR USERNAME HERE]
StrictHostKeyChecking no
IdentityFile ~/.ssh/id_ed25519-JSC
ProxyJump booster
Host *.jusuf
User [ADD YOUR USERNAME HERE]
StrictHostKeyChecking no
IdentityFile ~/.ssh/id_ed25519-JSC
ProxyJump jusuf
On the supercomputer:
srun --time=00:05:00 \
--nodes=1 --ntasks=1 \
--partition=gpus \
--account training2303 \
--cpu_bind=none \
--pty /bin/bash -i
bash-4.4$ hostname # This is running on a compute node of the supercomputer
jsfc013
bash-4.4$ cd $HOME/course/$USER
bash-4.4$ source sc_venv_template/activate.sh
bash-4.4$ mlflow uiOn your machine:
ssh -L :3334:localhost:5000 jsfc013i.jusuf
Mind the i letter I added at the
end of the hostname
Now you can access the service on your local browser at http://localhost:3334